Work distribution

All group members have contributed to the production of all elements of the project, including the explainer notebook and the website. Below, we outline who was the main responsible for the main parts of the project:

1. Motivation

The datasets

The Reddit dataset

For this project, we chose to work with data from the r/politics subreddit, an online forum with 8 million members "for current and explicitly political U.S. news." according to the rules stated on the site.

Visitors at r/politics will quickly notice that the majority of the submissions are by users posting links to news articles published on news media sites like CNN or The Huffington Post. The headlines of these linked articles are then shown on r/politics as the titles of the submissions. Other users can then comment on the linked article, which is what ultimately constitutes the actual user-generated content on the site.

We focused our data extraction to only include submissions from r/politics that fulfilled the following criteria:

Submission variables

The downloaded submissions would be structured in a Pandas dataframe containing the following variables for each submission in its respective columns:

  1. Time stamp: Simply stating when the submissions was made.
  2. Title: Being the title of the submission. Usually the header of the linked article.
  3. ID: A unique identifier for a particular submission.
  4. Author: The profile name of the author of the submissions.
  5. Number of comments: The number of comments received on the particular submission.
  6. URL: The link stated in the text of the submission.

Our query to extract comments from the r/politics subreddit was also focused to only include comments fulfilling the following criteria:

Comments variables

As stated, we would download the associated comments section for all the downloaded submissions. Similarly to the submissions, the comments would be structured in a Pandas dataframe containing the following variables for each comment in its respective columns:

  1. time stamp index: Simply stating when the submissions was made.
  2. id: A unique identifier for a particular submission.
  3. link_id: A unique identifier for the original submission to which the comment relates, somewhere in the comments section.
  4. author: The profile name of the author of the submissions.
  5. parent_id: A unique identifier for the post to which the comment was made. This may either be a submission or a comment, which is indicated with with the parentid starting with either "t3" or "t1_".
  6. body: The textual content of the comment.

Reasons for choosing this particular data set

  1. Easy to collect using the Pushshift API.
  2. Interesting topic that would fit the requirements of the project.
  3. Similar format as the data we've previously worked with.

The polling dataset

The polling data used in this project is collected from FiveThirtyEight. The data set contains polling results from our objective period, generated by different polling firms like YouGov and SurveyMonkey. Each poll has been generated over a smaller time window, typically spanning a couple of days, and with varying number of respondents. By the end of the polls, the percentage of respondents who have voted for a specific candidate has been registered. We have not investigated further how the individual polling firms have performed the polls, but we put our faith in their ability to make polls representative of the American population.

Goal for end user's experience

Our goal is for the end user's of our website to find the results of our analyses interesting.

Downloading the Reddit data

To ensure that each submissions is unambigously related to either one of the candidates, we simply remove all submissions containing both "Trump" and "Biden".

Now we're ready to download the associated comments sections for each of the remaining submissions.

2. Cleaning, preprocessing and stats

Processing the submissions

Determining the mentioned politician
We need to determine whether the collected submissions are relating to either Trump or Biden, as we have already removed all the submissions including both names. We do this simply by searching the title of the submissions for these names and adding a "politician" variable to each submission stating which of the politians are mentioned.

Cleaning and preprocessing the comments

Removing deleted comments
Some of the comments have been removed after being posted, so we'll do some cleaning first by filtering out the comments where author = "[deleted]", which will do the job. Similarly, we found that a large part of the comments were made by moderator robots reminding real redditors to behave in accordance with the subreddit rules. Comments made by these bots are also removed.

Removing comments by authors with less than 50 comments
We also remove all comments by authors who have posted less than 50 comments in total. The reason for this being that we would like to have a more solid foundation on which to infer the political convictions of the redditors, which would not be achieved if we had only very few comments made by them.

Finding parent authors for the comments
To later build our network of redditors, we will populate the comments dataframe with the names of the parent authors. This will create some [deleted] and NaN values which we will delete.

Tokenization
As we will also be looking into producing other textual analyses than the VADER sentiment scores, like wordclouds, we will also need to do some tokenization ourselves. For the body of each comment, we will do the following steps:

The results of this preprocessing will be a new column in the comments dataframe containing the cleaned tokens of the text body.

Determining whether the comments are "Trump" or "Biden" -related
For each of the collected comments, we need to know whether the comment relates to Trump or Biden. Our plan to achieve this is to first assign a "politician" variable to all comments that matches the "politician" value of the submission to which the comments was made.

This assignment of "politician" values i very naive, as it does not take into account the possibility of the actual comments mentioning either of the politicians. To overcome this, we will prune the comments section tree so that whenever a comment mentions a different politician than what is currently stated as its "politician" value, we will remove this comment along with all comments made to that comment - the so-called children of the comment. This way, we ensure an unambiguous picture of which politician each comment relates to.

This procedure requires us to also know the children of each comment, which find below:

Now we will figure out whether each comment itself mentions "Trump" or "Biden".

Statistics of the submissions

We'll look into the distribution of the submissions mentioning either Trump or Biden.

Statistics of the comments

Statistics of polling data

3. Tools, theory and analysis

Working with text

We wanted to use the text in the downloaded comments dataset for two reasons:

  1. To compute the daily sentiment for Biden and Trump and investigate whether there would be any interesting differences between the words used in Biden- and Trump-related comments respectively.
  2. To create a binary partitioning of the comment authors based on their expressed sentiment towards the two presidential candidates.

We chose to work with two different dictionary-based sentiment scoring methods. To compute the daily sentiment for Biden and Trump, which we would compare with the collected polling data, we used the Hedonometer word list data, which basically is a huge collection of words with associated average happiness scores as judged by people on Amazon's Mechanical Turk. To create the binary partioning of the comment authors, we used the similar Valence Aware Dictionary and sEntiment Reasoner (VADER) module from the nltk.sentiment library, which has been specifically created to work with text produced on social media. One of the great features of this module is that it is quite robust in terms of the needed data cleaning and processing to function properly. Typical processing steps like tokenization and stemming as well as removing stop words are consequently not required to have the VADER module work well and provide an indication of the sentiment of a body of text.

1. Computing the daily sentiment for Biden and Trump.

Grouping the data by the associated politician and make it into a single corpus

NOTE: Here, we've managed to delete even more comments leaving some authors with less than five comments.

For each author, we will assign a "supporter_of" attribute, corresponding to the candidate for which the concatenated related comments have the highest compound sentiment score.

Word Clouds

Validation of dictionary based methods

In order for it to work, the rule-of-thumb is to have at least 10.000 tokens per document. We group them daily over the period and plot it with the desired cutoff.

We disregard the last 3 days in the analysis going forward, partially due to them not living up to the criteria for the methods used, but also due to the difference in comments each day for each candidate, ie. Trump has comments on 2020-11-04, while Biden does not, whereas the reverse is true for 2020-11-04.

Happiness scores using the Hedonometer lexicon.

We load in the Hedonometer data regarding happiness scores for words and apply it in the analysis of the daily documents in the corpus.

Here we redefine the dataset to another format that is more maleable, ie. pandas dataframes with the indexing wanted.

The values are dropped in correspondence with the previous comments regarding cut-off of the tail end of the dictionary based methodology criterium.

Word Shifterator plots

We chose the election day as the day to remark upon, and compare it to the whole previous part of the dataset to investigate changes of sentiment over time. This amounts to a total of 33 days contained in the reference dictionary.

So the main topic surrounds trump and people, however on the day of election voting seems to be more significant, understandably so.

The $\delta\phi$ value identifies the extent a given set of words contribute to the difference in scores and how they do it. Either a word is used more often in one text over the other and its influence, positive or negative, is weighted in the relative frequency. This matches very well with the Word Cloud representation of the texts.

An overall negative net change in $\phi_{avg}$ due to top words.

Networks science tools

In order to model the interactions of the redditors with one another, we constructed an undirected network based on their comments and replies, where all nodes were the authors. We then created reciprocal edges such that each edge represented that the two authors had replied to the other at least once. Aftewards we assigned an edge weight based on the total number of comments the two authors had written to one another. Finally we removed all singleton nodes and self-loops from the graph. These methods were chosen as we would like to focus on discussions where both users interacted with one another and contributed to a dialogue in order to make some conclusion about political discussions online. In that case making the network undirected would simplify it and make it easier to plot.

One of our goals was to investigate if authors primarily interacted with other users of similar political conviction. Therefore as a preliminary analysis we calculate the fraction of edge weights between authors with the same political conviction. We find that around 56% of the comments were written to a user that shared the political conviction of the author. This could suggest that the users communicate slightly more often with people that share their political belief, as this might be easier and leads to a less polarizing discussion. However they still interact relatively broadly - often with someone they disagree with politically - as political disagreement might give users a larger incentive to reply to a comment. It is also important to keep in mind that there are some shortcomings in our assumptions of the users opinions however, as it was inferred from the sentiment and do not represent a ground truth. Therefore they might be different from the users' true political opinions which might cause some shortcomings for our later analysis.

We then visualised the network using the Netwulf package. Nodes with the political conviction attribute "Trump" were colored red while nodes with the attribute "Biden" are blue. We also varied the size of the nodes proportional to their strength, which in this case was the sum of the weights of the node's edges. Lastly we displayed the names of the 8 users with the largest degree as text labels.

From the graph we observe that the red and blue nodes are quite interconnected, and there doesn't seem to be two distinct communities for the supporters of Trump and Biden. This supports our calculations from earlier, but is also investigated in the later analysis. In some cases we observe a number of interconnected blue nodes, but these hubs might also appear because there are significantly more blue 'Biden' nodes than red nodes, 872 vs 474 respectively.

As seen in the figure, most of the nodes only have a few edges, while some nodes have a large degree. This might correspond to a few popular authors that interact mutually with many of the users. Most of the users only have a single mutual social connection. This is common for a real social network and will be investigated further in the next section.

Degree distribution

As a part of our analysis we investigated the degree distribution of the authors in our Reddit network, by plotting the probability density for the degrees in a histogram. This was also done to see if our distribution is similar to the ones encountered in other social networks.

It is observed that the probability mass is the highest for small degrees but then drops off rapidly for a larger amount of degrees. This could have the appearance of a power-law distribution, which is common for real social networks, and also supports what is mentiond Chapter 3 in the NS book. As a reference we compare it a similar randomly constructed Erdos Renyi network. The random network contains the same number of nodes and links, where the probability of two nodes sharing an edge is constant and has been derived from formula 3.3 in Chapter of the NS book.

The degree distribution for the random network is therefore evidently binomial and peaks close to the mean degree of 5.045.

We then plot the degree distribution of our Reddit network and the random network together in a log-scale plot to compare them. As the mean degree for the random network is nearly identical to that of the Reddit network, we just included the mean degree of the reddit network in the plot

For the random network the probability mass is condensed at a smaller area with the density dropping of significantly after its peak around degree 5. Also it is interesting to observe that no nodes have a degree larger than 13.

On the other hand the distribution for the real Reddit network is significantly more heavy-tailed with the largest observed degree being 51. As mentioned earlier the probability density of the degrees does not follow a poisson or binomial distribution, but instead approximately follows a power-law distribution. All of this was done to further examine the interactions between the users and we can conclude that the networks degree distribution more commonly associated with real social networks and is significantly different from that of a random network, which is intuitive.

Next we calculated the clustering coefficient for our network. This measures the density of links in node i’s immediate neighborhood on a scale from 0 to 1, for instance Ci = 0 would mean that there are no links between i’s neighbors; Ci = 1 implies that each of the i’s neighbors link to each other to investigate the general interconnectedness. Therefore it would tell us if one author just interacted with number of "seperated" users or if the hub is more akin to a "community" where the neighbors also reply to each other. The clustering coefficient is first calculated using the left-hand side of formula 3.21 in Section 3 of the NS book and the method provided in the networkx package. It was then compared with the approximative formula Ci = /N, which works well for random networks.

We observe that the real clustering coefficient in the network is significantly higher than the one approximated with the right hand side of formula 3.21, Ci = /N. This also supports the theory mentioned in Chapter 3 of the NS book. However the value itself is relatively low, which supports the hypothesis that a few active authors have a large number of connections, but their neighbors are often not connected.

First of all the clustering coefficient is generally significantly lower, and the approximation is also closer to the real clustering coefficient. All of this signifies that our network follows the properties of a real social network and not those of a random network.

This means that an author on average can reach any other author in around 4 or 5 links (if they are connected). So we assume that our network follows the 'small world property'

Community Analysis and statistical tests

We wanted to investigate if authors with the conviction of 'Trump' and 'Biden' can be seperated into two different communities within the graph. We use modularity as a measure in order to evaluate the partitioning. It measures how much the network deviates from the expected number of edges between nodes in a community if all of its wirings had been randomly constructed. Thus, higher modularity for a given network partition indicates better community structure, meaning that separate communities are only sparsely connected compared to the densely connected individual communities.

Based on section 9 in the NC book, the connectivity between the nodes in the communities is therefore assumed to be close to random and explained by the degree distribution. Additionaly it is not possible to find a good cutoff with few connections in order to seperate the 'Trump' and 'Biden' nodes. Therefore it is unlikely that a 'Trump' and 'Biden' community exists in the graph. This means that users with different political conviction interact often, and our findings support our previous calculations.

We then use the Louvain algorithm to find the best partitioning of the graph. This is in order to examine the communities that exist in the network and their composition of 'Trump' and 'Biden' nodes.

The Louvain algorithm found 24 communities in the network with a modularity of 0.537. This suggests a stronger community structure and that it is more unlikely the edges between the nodes in the communities have been created at random.

We then visualise the largest community using the Netwulf package.

We observe that the community contains a large number of both red red and blue nodes and that they are relatively interconnected. This suggests that users often interact with others that do not share their political belief, which also supports the general trend observed earlier.

Lastly we examined if the found modularities were statistically different from 0? We did this by creating 1000 randomized versions of the Reddit network using the DSE algorithm. Then we started by computing the modularity of the "trump/biden" split for for each of them.

After the random rewiring, the mean modularity is centered around 0. Therefore they don't represent potential communities. This is also expected from a randomly wired network, as described in the theory.

Looking at the plot, the modularity of the trump/biden split lies within the emirical modularity distribution for the randomly altered models. This indicates that the edges between the communities of 'trump' and 'biden' are explained by the degree distribution and the connectivity is close to that of a random graph. Therefore the trump/biden split is not a good partitioning as its modularity is not statistically significantly different from 0.

We then run the same experiment for the Louvain partitioning:

As seen in the plot, none of the 1000 randomly configured networks had a modularity higher than the one observed in the split suggested by the Louvain algorithm. This suggests there are fewer edges between the Louvain communities than one would expect using a randomly connected graph. Therefore the splits calculated by the Louvain algorithm is a good partitioning, and we can conclude that its modularity is statistically significantly different from 0.

4 Discussion

What went well?

We're quite happy with our choice of topic for the project. Also, it was great to work with Github Pages, which was a first for all members of the group.

What is missing, and what could be improved?

It would undoubtedly have been interesting to investigate the political content on r/politics over a longer time period, e.g. six months preceeding the election day, which would allow for the detection of longer term trends in redditor activity and sentiment. For such a scope to be feasible a larger commitment to computational power would be required than the project had available. Along that train of thought, incorperating other US related politics subreddits into the analysis may give a more broad description, and in-depth for the edge cases of extremism on either side of the political spectrum.

Since the submissions on r/politics are links to different articles from many news sources, magazines, etc. one may delve into the specific sentiments used in the articles, and how they relate to the sentiments of the comments, ie. is a Fox News article surrounded primarily by Trump-supporters or his critics? This would expand the field of research quite a bit.

Furthermore, improving on the text analysis with regards to the contents of the comments in a more informed way, such as accounting for bias, sarcasm, et cetera would be advised.

Another possibility would be to create a model through bi-grams and Naïve-Bayes classification in order to predict polling data using machine learning.